Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

strobealign-aemb module for metagenomic binning #394

Merged
merged 29 commits into from
Mar 4, 2024
Merged

Conversation

psj1997
Copy link
Contributor

@psj1997 psj1997 commented Feb 13, 2024

Hi,

This is the aemb (abundance estimation for metagenomic binning) module, the usage is :

strobealign ref.fa read1.fq read2.fq --aemb > abun.txt

In my benchmarking, aemb can output similar results, but decrease more than 90% running time.

Sincerely
Shaojun

@ksahlin
Copy link
Owner

ksahlin commented Feb 14, 2024

Thanks Shaojun. I think Marcel is a bit busy right now but I am sure he will have a look when there is sufficient time.

Copy link
Collaborator

@marcelm marcelm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, here’s a partial review. Some additional comments:

  • Tests are needed
  • Documentation needs to be added (including a description of the file format).
  • Please ensure that --aemb and -x cannot be used at the same time.

src/aln.cpp Outdated Show resolved Hide resolved
src/main.cpp Outdated Show resolved Hide resolved
src/main.cpp Outdated Show resolved Hide resolved
src/aln.cpp Outdated Show resolved Hide resolved
src/aln.cpp Outdated Show resolved Hide resolved
src/aln.cpp Outdated Show resolved Hide resolved
src/aln.cpp Outdated
std::vector<float> &abundance,
int read1_len,
int read2_len,
bool output_abundance
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of modifying this function, I believe it would be better to just add a new function, maybe named get_abundances or so. You can copy the get_best_map_location function and adjust it as necessary. That will add some duplication, but that can be fixed later.

Please make it so that the function returns the abundances.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please make it so that the function returns the abundances.

I take that back; that would be less efficient. It is ok to have this as a mutable in/out argument.

src/cmdline.cpp Outdated Show resolved Hide resolved
src/cmdline.cpp Outdated Show resolved Hide resolved
Copy link
Collaborator

@marcelm marcelm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here are some more comments. Also, this needs documentation and tests.

src/aln.cpp Outdated
std::vector<float> &abundance,
int read1_len,
int read2_len,
bool output_abundance
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please make it so that the function returns the abundances.

I take that back; that would be less efficient. It is ok to have this as a mutable in/out argument.

src/aln.cpp Outdated Show resolved Hide resolved
src/main.cpp Outdated Show resolved Hide resolved
src/main.cpp Outdated Show resolved Hide resolved
src/main.cpp Outdated Show resolved Hide resolved
src/aln.cpp Outdated Show resolved Hide resolved
src/aln.cpp Outdated Show resolved Hide resolved
psj1997 and others added 26 commits February 20, 2024 22:48
This way, the strobealign sources are more self-contained and it should not
be necessary to have internet access at build time.
See https://github.com/alugowski/poolSTL

Sorting the randstrobes is currently a bottleneck in index generation as it
does not run in parallel. This is an attempt to parallelize it.

poolstl’s sort uses regular std::sort under the hood.
We currently use pdqsort_branchless, which is about twice as fast as
std::sort, so parallel sorting breaks even with pdqsort_branchless at about
2-3 threads. It gets faster with more threads, but not as much as one would
perhaps expect. Here are the sorting runtimes for CHM13:

- 31 s with pdqsort_branchless
- 59 s with std::sort
- 34 s with parallel sort, 2 threads
- 24 s with parallel sort, 4 threads
- 23 s with parallel sort, 8 threads

Another issue is that sorting is no longer in place, so memory usage goes
up by a couple of gigabytes, which is another reason for me not to make this
change.
This version indeed gives us a very nice speed improvement without using
extra memory:

Sorting-only runtimes:
* 1 thread: 32.8 s
* 2 threads: 20.3 s
* 4 threads: 15.7 s
* 8 threads: 14.8 s

Overall indexing runtimes (before/after):
* 1 thread: 151 s → 153 s
* 2 threads: 100 s → 88 s
* 4 threads: 73 s → 57 s
* 8 threads: 63 s → 47 s
... by sorting them also by position. This way, it does not matter in which
way the randstrobes vector is partitioned during sort. Otherwise, the order
would depend on the number of threads used to create the index, making
mapping results not reproducible across runs that do not use the same no. of
threads.

Note: There is still a tiny chance for collisions/nondeterminism because we
ignore RefRandstrobe::m_packed.

For efficiency, this uses a branchless comparison function inspired by
@alugowski’s comment in PR ksahlin#386.

Runtimes for sorting with one thread

32 s - only by hash
45 s - by hash and position using std::tie
42 s - by hash and position using branchless_compare from the PR
35 s - by hash and position using __uint128_t (this commit)

Runtimes for sorting (four cores with hyperthreading)

threads | sorting time | index creation time
-|-|-
1 | 35 s | 154 s
2 | 25 s | 95 s
4 | 16 s | 60 s
8 | 15 s | 48 s
Right now, strobealign only supports up to 2²⁴ sequences. If the user
tries more, it would silently accept it, but later crash.

This was triggered when trying to map to the Greengenes database

https://ftp.microbio.me/greengenes_release/2022.10/
https://ftp.microbio.me/greengenes_release/2022.10/2022.10.seqs.fna.gz

Even a single read like the one below trigger a crash

```
@M05314:127:000000000-BWLLJ:1:1101:15267:1654 2:N:0:1
CCTGTTCGCTCCCCACGCTTTCGTCCCTCAGCGTCAATATTGTGCCAGAATGCTGCCTTCGCCATTGGTGTTCCTCCTGATATCTACGCATGTCACCGCTACACCAGGAATTCCACATTCCTCTCACATATTCTATTTTATCAGTTTTGAT
+
AAA1AF@1>AAAGG1A0EAFGGEHAAEGFCG1AAEE/F2FG2F2FF1CA0FBDED1BGFGFFE?AF1BFFCFHDGFFHB1FFGFGEEFE/?/BF2F@/EGEEB00/0//0BFG1>B1BGFEFHHGGFFD12BGH2FDFFFGG22GDD>@/F
```
This is 3-state variable: SAM/PAF/Abundance
Copy link
Collaborator

@marcelm marcelm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just two more tiny comments. Also, I will probably rearrange the documentation a little bit (clarifying the different output modes a bit) after this has been merged, but it’s probably easiest if it’s not part of this PR. I’ll also add a changelog entry later.

src/aln.cpp Outdated
std::make_pair(std::cref(nams2), read2_len) }) {
size_t best_score = 0;
// We loop twice because we need to count the number of NAMs with best score
for (auto &t : nams) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please rename t to nam?

(Also below in align_or_map_single)

src/main.cpp Outdated
@@ -105,6 +105,12 @@ InputBuffer get_input_buffer(const CommandLineOptions& opt) {
}
}

void output_abundance(std::vector<double> abundances, References references){
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both parameters should be passed by const reference

src/main.cpp Outdated Show resolved Hide resolved
Co-authored-by: Marcel Martin <[email protected]>
@marcelm
Copy link
Collaborator

marcelm commented Mar 4, 2024

This looks good now! I think this deserves a slightly longer section in the documentation, but I’ll do that after merging.

@marcelm marcelm merged commit bd183f5 into ksahlin:main Mar 4, 2024
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants